Welcome to a data analysis case study on Cyclistic’s bike sharing. In this case study, I will work as a junior data analyst for a fictitious organization called Cyclistic. The company’s stakeholder believes that the company’s future success depends on the annual membership subscription rate. As a junior data analyst working in the company’s marketing department, my team wants to understand how annual and casual members use Cyclistic bikes differently and provide. From these insights, my team is required to develop new marketing strategies to increase the sales of annual memberships.
Cyclistic introduced a popular bike-share program in 2016. The initiative has expanded since then to include a fleet of 5,824 bicycles that are geo-tracked and locked into a system of 692 stations throughout Chicago. The bikes may be released from one station and brought back to any other station in the network. Up to this point, Cyclistic’s marketing approach focused on raising public awareness and appealing to a wide range of consumer groups. The price plans’ flexibility, which included single-ride passes, full-day passes, and annual memberships, was one strategy that assisted in making these things possible. Casual riders are those who buy one-ride or all-day passes from the company. Cyclistic members are customers who purchase annual memberships. Although the pricing flexibility helps Cyclistic attract more customers, especially casual riders, Cyclistic’s finance analysts have concluded that annual members are much more profitable. The company believes that, rather than creating a marketing campaign to attract new casual riders, it should focus on converting casual riders into members as they are already aware of the annual membership program and have chosen Cyclistic for their mobility needs. The marketing analyst team needs to have a deeper understanding of the distinctions between annual members and casual riders, the motivations behind why casual riders would purchase a membership, and the potential impact of digital media on their marketing strategies.
In phase 1, our team was asked to address three questions that will guide the future marketing program: How do annual members and casual riders use Cyclistic bikes differently? Why would casual riders buy Cyclistic annual memberships? How can Cyclistic use digital media to influence casual riders to become members? I was assigned the first question: How do annual members and casual riders use Cyclistic bikes differently? To understand the behavior of the riders, I decided to analyze how differently annual members and casual riders use the Cyclistic bikes during a year, which type of bike they prefer most, and the most popular stations.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. So the marketing analysts team is requested to understand the behavior of Cyclistic’s customers and provide solutions with recommendations with supporting data insights and professional data visualizations.
For this case study, the dataset is provided by the Cyclistic itself as CSV files. The dataset is organized according to months, and we will use one year of data from July 2021 to June 2022. As the data is sourced from the first party, the data’s credibility is excellent, licensed, and cited.
The dataset was downloaded and stored locally on a hard drive and saved as CSV files. The dataset has a total of 13 columns and 5,900,385 rows.
Source of the dataset:- https://divvy-tripdata.s3.amazonaws.com/index.html
We will use R studio for Cleaning, analyzing, and visualizing this case study. MSExcel could not handle the dataset as it is large.
For this case study, the following packages will be used.
#installing required libraries
# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("geosphere")
# install.packages("skimr")
# install.packages("mapview")
#lording libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(dplyr)
library(geosphere)
library(skimr)
library(mapview)
#lording data into bike_share
bike_share<- list.files(path="C:/Users/akhil/OneDrive/Desktop/Case Study/data/trip_data", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bike_share)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at start_station_n…
## <chr> <chr> <dttm> <dttm> <chr>
## 1 0A1B62… docked_bike 2021-07-02 14:44:36 2021-07-02 15:19:58 Michigan Ave & …
## 2 B2D558… classic_bike 2021-07-07 16:57:42 2021-07-07 17:16:09 California Ave …
## 3 6F2645… classic_bike 2021-07-25 11:30:55 2021-07-25 11:48:45 Wabash Ave & 16…
## 4 379B58… classic_bike 2021-07-08 22:08:30 2021-07-08 22:23:32 California Ave …
## 5 6615C1… electric_bike 2021-07-28 16:08:06 2021-07-28 16:27:09 California Ave …
## 6 62DC2B… electric_bike 2021-07-29 17:09:08 2021-07-29 17:15:00 California Ave …
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>
tail(bike_share)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at start_station_n…
## <chr> <chr> <dttm> <dttm> <chr>
## 1 F1C4F6… electric_bike 2022-06-26 19:50:04 2022-06-26 19:55:23 Clarendon Ave &…
## 2 7B3B28… classic_bike 2022-06-25 00:56:48 2022-06-25 01:01:39 Sheffield Ave &…
## 3 1E9939… classic_bike 2022-06-25 00:56:25 2022-06-25 01:00:26 Sheffield Ave &…
## 4 AEA166… electric_bike 2022-06-12 12:47:12 2022-06-12 12:47:36 Milwaukee Ave &…
## 5 B9F527… classic_bike 2022-06-12 13:28:46 2022-06-12 13:53:11 Clark St & Rand…
## 6 D24131… classic_bike 2022-06-12 14:40:51 2022-06-12 15:08:14 Blue Island Ave…
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>
colnames(bike_share)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
dim(bike_share)
## [1] 5900385 13
str(bike_share)
## spec_tbl_df [5,900,385 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5900385] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr [1:5900385] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5900385], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
## $ ended_at : POSIXct[1:5900385], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
## $ start_station_name: chr [1:5900385] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:5900385] "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr [1:5900385] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr [1:5900385] "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num [1:5900385] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5900385] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:5900385] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5900385] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:5900385] "casual" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(bike_share)
## ride_id rideable_type started_at
## Length:5900385 Length:5900385 Min. :2021-07-01 00:00:22.00
## Class :character Class :character 1st Qu.:2021-08-26 07:57:58.00
## Mode :character Mode :character Median :2021-10-27 17:35:55.00
## Mean :2021-12-12 00:11:36.51
## 3rd Qu.:2022-04-25 13:41:23.00
## Max. :2022-06-30 23:59:58.00
##
## ended_at start_station_name start_station_id
## Min. :2021-07-01 00:04:51.00 Length:5900385 Length:5900385
## 1st Qu.:2021-08-26 08:11:00.00 Class :character Class :character
## Median :2021-10-27 17:49:46.00 Mode :character Mode :character
## Mean :2021-12-12 00:31:53.47
## 3rd Qu.:2022-04-25 13:57:17.00
## Max. :2022-07-13 04:21:06.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5900385 Length:5900385 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
##
## end_lat end_lng member_casual
## Min. :41.39 Min. :-88.97 Length:5900385
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.17 Max. :-87.49
## NA's :5374 NA's :5374
skim_without_charts(bike_share)
| Name | bike_share |
| Number of rows | 5900385 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5900385 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 836018 | 0.86 | 3 | 64 | 0 | 1293 | 0 |
| start_station_id | 836015 | 0.86 | 3 | 44 | 0 | 1157 | 0 |
| end_station_name | 892103 | 0.85 | 9 | 64 | 0 | 1315 | 0 |
| end_station_id | 892103 | 0.85 | 3 | 44 | 0 | 1171 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 5374 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 5374 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-07-01 00:00:22 | 2022-06-30 23:59:58 | 2021-10-27 17:35:55 | 4924385 |
| ended_at | 0 | 1 | 2021-07-01 00:04:51 | 2022-07-13 04:21:06 | 2021-10-27 17:49:46 | 4924865 |
In this stage, we will clean the data, create a few new variables for further calculation, analyze, and convert the data type into a few columns. As some of the data in the data set were in uppercase, it is better to convert all the data into lowercase to achieve consistency and avoid redundancy.
#converting data into lower case
bike_share$ride_id <- tolower(bike_share$ride_id )
bike_share$rideable_type <- tolower(bike_share$rideable_type )
bike_share$start_station_name <- tolower(bike_share$start_station_name )
bike_share$start_station_id <- tolower(bike_share$start_station_id )
bike_share$end_station_name <- tolower(bike_share$end_station_name )
bike_share$end_station_id <- tolower(bike_share$end_station_id )
bike_share$member_casual <- tolower(bike_share$member_casual)
unique(bike_share$rideable_type)
## [1] "docked_bike" "classic_bike" "electric_bike"
unique(bike_share$member_casual)
## [1] "casual" "member"
#explicitly changing the data type
bike_share$start_date <- as.Date(bike_share$started_at, format = "%m/%d/%Y")
bike_share$start_time <- as.POSIXct(bike_share$started_at,format="%H:%M:%S")
bike_share$end_date <- as.Date(bike_share$ended_at, format = "%m/%d/%Y")
bike_share$end_time <- as.POSIXct(bike_share$ended_at,format="%H:%M:%S")
#defining new Columns for weekday, month, total time, and distance rode
bike_share$day <- weekdays(as.Date(bike_share$start_date ))
bike_share$month <- months(as.Date(bike_share$start_date ))
bike_share$ride_length <-difftime(bike_share$end_time,bike_share$start_time,units = "secs")
bike_share$ride_length <- as.numeric(as.character(bike_share$ride_length))
bike_share$ride_distance <- distGeo(matrix(c(bike_share$start_lng, bike_share$start_lat), ncol=2), matrix (c(bike_share$end_lng, bike_share$end_lat), ncol=2))
bike_share$ride_distance <- bike_share$ride_distance/1000
The next step is to aggregate the data into a useful and accessible state, organize and format the data as per requirements, perform some calculations, and identify trends and relationships.
Removing rows where ride length is less than or equal to 0 since the data is inaccurate.
#removing rows with total time rode less than or equal to 0 second
bike_share <- bike_share %>%
arrange(ride_length)
bike_share_duplicate <- bike_share[!(bike_share$ride_length <=0),]
bike_share_duplicate <- bike_share_duplicate[!(bike_share_duplicate$start_lng == -73.79647698 | bike_share_duplicate$start_lat == 45.635034323),]
Removing rows with null values in the latitude and longitude of the end station.
#removing rows with null values in end_lat and end_lng
bike_share_duplicate<- bike_share_duplicate[-which(is.na(bike_share_duplicate$end_lat| bike_share_duplicate$end_lng)), ]
Creating a box plot to identify the outlier.
boxplot(bike_share_duplicate$ride_length,
ylab = "bike_share_duplicate$ride_length")
Distribution of rides over time into several sectors to understand the distribution of the ride length in second.
#plotting a pie chart to check the distribution of rides over time
bike_share_pie1 <- bike_share_duplicate %>%
mutate(ranges = cut(ride_length, c(0, 60, 300, 600, 3600, 7200, 14400, Inf))) %>%
group_by(ranges) %>%
summarize(number=n())
pie_lables1<- paste0(round(100 * bike_share_pie1$number / sum(bike_share_pie1$number), 1),"%")
pie(bike_share_pie1$number, labels =pie_lables1 , main = "Distribution of ride over time",
col = rainbow(length(bike_share_pie1$number)))
legend("topright",c( "(0,60]", "(60,300]", "(300,600]", "(600,3.6e+03]", "(3.6e+03,7.2e+03]", "(7.2e+03,1.44e+04]", "(1.44e+04,Inf]"), cex = 0.8, fill = rainbow(length(bike_share_pie1$number)))
Removing rows where ride length is less than one minute since the data is irrelevant.
#filtering out the data from 61 seconds on words in respect to ride_length
bike_share_duplicate <- subset(bike_share_duplicate, ride_length >60)
bike_share_duplicate <- bike_share_duplicate %>%
arrange(-ride_length)
#plotting a pie chart to check the distribution of rides between 61 and 3600
bike_share_pie2 <- bike_share_duplicate %>%
mutate(ranges = cut(ride_length, c(60, 300, 600, 3600, 7200, 14400, Inf))) %>%
group_by(ranges) %>%
summarize(number=n())
pie_lables2<- paste0(round(100 * bike_share_pie2$number / sum(bike_share_pie2$number), 1),"%")
pie(bike_share_pie2$number, labels = pie_lables2, main = "Distribution of ride over time after filter",
col = rainbow(length(bike_share_pie2$number)))
legend("topright",c( "(60,300]", "(300,600]", "(600,3.6e+03]", "(3.6e+03,7.2e+03]", "(7.2e+03,1.44e+04]", "(1.44e+04,Inf]"), cex = 0.8, fill = rainbow(length(bike_share_pie2$number)))
The below box plot shows outliers. But as the data is relevant, the data cannot be dropped.
boxplot(bike_share_duplicate$ride_length,
ylab = "bike_share1$ride_length")
Description of the new data set.
skim_without_charts(bike_share_duplicate)
| Name | bike_share_duplicate |
| Number of rows | 5794612 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| Date | 2 |
| numeric | 6 |
| POSIXct | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5794612 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 810999 | 0.86 | 3 | 64 | 0 | 1291 | 0 |
| start_station_id | 810996 | 0.86 | 3 | 37 | 0 | 1155 | 0 |
| end_station_name | 855317 | 0.85 | 9 | 64 | 0 | 1309 | 0 |
| end_station_id | 855317 | 0.85 | 3 | 37 | 0 | 1167 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| day | 0 | 1.00 | 6 | 9 | 0 | 7 | 0 |
| month | 0 | 1.00 | 3 | 9 | 0 | 12 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| start_date | 0 | 1 | 2021-07-01 | 2022-06-30 | 2021-10-27 | 365 |
| end_date | 0 | 1 | 2021-07-01 | 2022-07-01 | 2021-10-27 | 366 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 0 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 0 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
| ride_length | 0 | 1 | 1129.70 | 6444.15 | 61.00 | 389.00 | 682.00 | 1225.00 | 2946429.00 |
| ride_distance | 0 | 1 | 2.20 | 1.96 | 0.00 | 0.93 | 1.64 | 2.87 | 114.57 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-07-01 00:00:22 | 2022-06-30 23:59:58 | 2021-10-27 13:15:55 | 4850042 |
| ended_at | 0 | 1 | 2021-07-01 00:04:51 | 2022-07-01 18:52:41 | 2021-10-27 13:30:28 | 4850096 |
| start_time | 0 | 1 | 2021-07-01 00:00:22 | 2022-06-30 23:59:58 | 2021-10-27 13:15:55 | 4850042 |
| end_time | 0 | 1 | 2021-07-01 00:04:51 | 2022-07-01 18:52:41 | 2021-10-27 13:30:28 | 4850096 |
Creating functions to pot graphs.
#creating a function to create pie chart
pie_chart <- function(data, grouping_var, title) {
tab <- table(data[[deparse(substitute(grouping_var))]])
pie_lables<- paste0(round(100 * tab / sum(tab), 1),"%")
pie(tab, labels = pie_lables, main = title,
col = rainbow(length(tab)))
legend("topright", names(tab), cex = 0.8, fill = rainbow(length(tab)))
}
#creating a function to create bar graph
bar_graph <- function(data, grouping_var, title,xtitle, ytitle){
df <- table(data[[deparse(substitute(grouping_var))]])
barplot(df,
main=title,
xlab=xtitle,
ylab=ytitle,
border="black",
col=rainbow(7)
)
}
#creating a function to create a group graph
group_bar <- function(data, gp1, gp2, mtitle, xtitle, ytitlt, ltitle, n=1){
df1 <- data_frame(data %>%
group_by({{gp1}}, {{gp2}}) %>%
summarise(number_of_rides = n()))
if(n==0){
df1[[1]] <- ordered(df1[[1]],
levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
}
ggplot(df1,
aes(x = df1[[1]],
y = df1[[3]],
fill = df1[[2]]))+ labs(title=mtitle, x =xtitle, y =ytitlt , fill = ltitle)+
geom_bar(stat = "identity",
position = "dodge") +geom_text(aes(label = df1[[3]]),position = position_dodge(width = 1),
vjust = -0.5, size = 2) +newtheme
}
#creating a function to create line graph
line_plot <- function(data, gp1, gp2, mtitle, xtitle, ytitlt, ltitle){
df1 <- data %>%
group_by({{gp1}},{{gp2}} ) %>%
summarise(number_of_rides = n())
df1[[1]] <- factor(df1[[1]], levels= c("July", "August",
"September", "October", "November", "December", "January","February","March", "April","May","June"))
data[order(df1[[1]]), ]
ggplot(data=df1, aes(x= df1[[1]], y = df1[[3]], group =df1[[2]], color= df1[[2]] )) +
geom_line()+
geom_point()+
labs(title=mtitle,
x =xtitle, y =ytitlt , color = ltitle)+geom_text(aes(label = df1[[3]]),position = position_dodge(width = 1),
vjust = -0.5, size = 2) +newtheme
}
#creating a function to create a line graph for average distance rode
average_ride <- function(data, gp1, gp2, sumz1, mtitle, xtitle, ytitlt, ltitle, n=0){
df1<- data %>%
group_by({{gp1}}, {{gp2}})%>%
summarise(mean({{sumz1}}))
if (n==1) {
df1[[2]] <- factor(df1[[2]], levels= c("July", "August",
"September", "October", "November", "December", "January","February","March", "April","May","June"))
data[order(df1[[2]]), ]
} else {
df1[[2]]<- factor(df1[[2]], levels= c("Sunday", "Monday",
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
data[order(df1[[2]]), ]
}
ggplot(data=df1, aes(x= df1[[2]], y = df1[[3]], group =df1[[1]], color= df1[[1]] )) +
geom_line()+
geom_point()+
labs(title=mtitle,
x =xtitle, y =ytitlt , color = ltitle) +geom_text(aes(label = as.integer(df1[[3]])),position = position_dodge(width = 1),
vjust = -0.5, size = 2) +newtheme
}
average_ride1 <- function(data, gp1, gp2,gp3, sumz1, mtitle, xtitle, ytitlt, ltitle){
df1<- data %>%
group_by({{gp1}}, {{gp2}},{{gp3}})%>%
summarise(mean({{sumz1}}))
df1[[3]] <- factor(df1[[3]], levels= c("July", "August",
"September", "October", "November", "December", "January","February","March", "April","May","June"))
data[order(df1[[3]]), ]
ggplot(data=df1, aes(x= df1[[3]], y = df1[[4]], group =df1[[2]], color= df1[[2]] )) +
geom_line()+
geom_point()+
facet_wrap(vars(df1[[1]]))+
labs(title=mtitle,
x =xtitle, y =ytitlt , color = ltitle)+geom_text(aes(label = as.integer(df1[[4]])),position = position_dodge(width = 1),
vjust = -0.5, size = 2) +newtheme
}
total_ride <- function(data, gp1, gp2,gp3, mtitle, xtitle, ytitlt, ltitle){
df1<- data %>%
group_by({{gp1}}, {{gp2}},{{gp3}})%>%
summarise(number_of_rides = n())
df1[[3]] <- factor(df1[[3]], levels= c("July", "August",
"September", "October", "November", "December", "January","February","March", "April","May","June"))
data[order(df1[[3]]), ]
ggplot(data=df1, aes(x= df1[[3]], y = df1[[4]], group =df1[[2]], color= df1[[2]] )) +
geom_line()+
geom_point()+
facet_wrap(vars(df1[[1]]))+
labs(title=mtitle,
x =xtitle, y =ytitlt , color = ltitle)+geom_text(aes(label = as.integer(df1[[4]])),position = position_dodge(width = 1),
vjust = -0.5, size = 2) +newtheme
}
Handling missing values of the names of starting stations
#filling few missing start station names
bike_share_duplicate <- bike_share_duplicate %>%
group_by(start_lat, start_lng) %>%
arrange(start_station_name) %>%
fill(start_station_name)
Creating a few tables to calculate the total number of rides from different customer types and bike types for each Start station.
#creating a table to calculate the total number of rides for each station as per membership type and bike type
bike_share_station <- bike_share_duplicate %>%
group_by(member_casual,start_station_name)%>%
summarise(number_of_rider= n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
bike_share_station1 <- bike_share_duplicate %>%
group_by(start_station_name)%>%
summarise(number_of_rider= n())
bike_share_station_bike <- bike_share_duplicate %>%
group_by(start_station_name, rideable_type)%>%
summarise(number_of_rider= n())
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
#creating a table with the start station name and its location coordinates
bike_share_station_location<- bike_share_duplicate %>%
select(start_station_name, start_lat,start_lng)
#deleting duplicate values on the bases of start station names
bike_share_station_location <- bike_share_station_location[!duplicated(bike_share_station_location$start_station_name),]
#merging two tables
bike_share_station <- merge(x=bike_share_station,y=bike_share_station_location,by="start_station_name",all=TRUE)
bike_share_station <-bike_share_station[!is.na(bike_share_station$start_station_name),]
bike_share_station1<- merge(x=bike_share_station1,y=bike_share_station_location,by="start_station_name",all=TRUE)
bike_share_station1 <-bike_share_station1[!is.na(bike_share_station1$start_station_name),]
bike_share_station_bike<- merge(x=bike_share_station_bike,y=bike_share_station_location,by="start_station_name",all=TRUE)
bike_share_station_bike<- bike_share_station_bike[!is.na(bike_share_station_bike$start_station_name),]
bike_share_station <- bike_share_station%>%
arrange(-number_of_rider,member_casual,start_station_name)
bike_share_station1 <- bike_share_station1%>%
arrange(-number_of_rider)
bike_share_station_bike <- bike_share_station_bike%>%
arrange(-number_of_rider)
#splitting the table into two
bike_share_station_member<- bike_share_station[bike_share_station$member_casual == "member", ]
bike_share_station_causal<- bike_share_station[bike_share_station$member_casual == "casual", ]
bike_share_station_docked_bike<- bike_share_station_bike[bike_share_station_bike$rideable_type == "docked_bike", ]
bike_share_station_classic_bike<- bike_share_station_bike[bike_share_station_bike$rideable_type == "classic_bike", ]
bike_share_station_electric_bike<- bike_share_station_bike[bike_share_station_bike$rideable_type == "electric_bike", ]
Creating a custom Theme for the Graphs.
#setting up theme for graphs
newtheme <- theme_light() +
theme(plot.title = element_text(color = "#002949", face = 'bold', size =10),
panel.border = element_rect(color = "#002949", size = 1),
legend.position = "right",
legend.text = element_text(colour="red", size=8, face="bold"),
legend.title = element_text(colour="red", size=8, face="bold"),
axis.title.x = element_text(colour = "#002949"),
axis.title.y = element_text(colour = "#002949"),
axis.text.x = element_text(angle = 45, hjust = 1, color = '#002949'),
axis.text.y = element_text(angle = 45, hjust = 1, color = '#002949'),
axis.line = element_line(color = "#002949", size =1),
)
Increase the price of classic bike and electric bike for casual rides as most of the casual riders prefer this type of bike and give temporary offers on the docked bikes for an annual membership. This strategy could help the company to motivate casual members to buy annual memberships.
Even if annual members have more rides, the average time and distance are more for casual users. So increasing the cost of the hourly and daily passes could force the casual members to buy an annual pass.
Casual riders have a higher count of renting during the weekend. Increasing the rent on weekends could force them to shift to an annual membership.
Introducing a special weekend membership, which can later be converted into an annual membership, will be beneficial, as casual riders are more frequent during weekends.
Stations like Clark st & lincoln ave and Dusable lake shore dr & Monroe st are the busiest start stations for casual riders. A campaign to promote annual membership in these places would help in faster results.